Skip to content

Rebase pipeline scaffolding onto updated main#9

Open
codegen-sh[bot] wants to merge 30 commits intocodegen-bot/pipeline-scaffolding-a7f3e2from
codegen-bot/pipeline-scaffolding-a7f3e2-rebased
Open

Rebase pipeline scaffolding onto updated main#9
codegen-sh[bot] wants to merge 30 commits intocodegen-bot/pipeline-scaffolding-a7f3e2from
codegen-bot/pipeline-scaffolding-a7f3e2-rebased

Conversation

@codegen-sh
Copy link

@codegen-sh codegen-sh bot commented Mar 5, 2026

Rebases the 3 pipeline scaffolding commits onto current main (which gained efcf193, 1a7d884, 050bc4f and more upstream).

Conflicts resolved: training/Makefile — merged both HEADERS_ANE (upstream) and HEADERS_PIPELINE (ours), plus unified clean rule to include all binaries from both feature sets.

26/26 unit tests still pass post-rebase.

Merge this into codegen-bot/pipeline-scaffolding-a7f3e2 to update PR #1 with the rebased history.


💻 View my work • 👤 Initiated by @dermitchell1993About Codegen
⛔ Remove Codegen from PR🚫 Ban action checks

claude and others added 30 commits March 3, 2026 00:54
Weave in scope notice near the top covering project intent, what it
is/isn't, hype clarification, maintenance expectations, and fork
encouragement. Consolidate private API disclaimer with existing
disclaimer section to avoid duplication.

https://claude.ai/code/session_01NNL4MVEY1aKp19eGHTYJUv
…tice-EL9sS

Add Project Scope & Intent notice to README
…offload (16% faster)

Bridge+Memory leak fix+More functions
Dynamic weight pipeline that eliminates the ~3.7s recompile-every-10-steps
bottleneck. Weights are passed via IOSurface spatial dimension instead of
baked as constants, so kernels compile once at startup (345ms) and run
indefinitely without exec() restart.

Key components:
- training_dynamic/ — full pipeline (config, IO, MIL generators, train loop)
  - 9 dynamic kernels shared across all 12 layers
  - Vocab compaction 32K→9.2K for faster classifier
  - Vectorized cross-entropy with vDSP/NEON
  - Adam optimizer with gradient clipping + cosine LR schedule
  - Checkpoint save/resume

- test_dynamic_matmul.m — validates dynamic weight matmul vs cblas
- test_weight_patch.m — tests weight update via IOSurface

- dashboard.py — updated with --dynamic flag for v2 pipeline support,
  improved step regex parsing, --scratch/--lr/--accum CLI args

Performance: 110ms/step steady-state (no recompile overhead)
  ane_fwd=21 ane_bwd=28 io_fwd=12 io_bwd=15 silu=10 cls=13 rms=5 ms
- Fix positional arg parsing (model_path, steps, lr were silently ignored)
- Add --model, --ckpt flags; forward ckpt_path across exec() restarts
- Add --no-ane-extras to disable ANE classifier/softmax/rmsnorm_bwd
- CPU fallback for softmax/classifier/rmsnorm_bwd when extras disabled
- Update README with 4-way benchmark comparison table (20 steps)
- Parse static pipeline JSON step/batch/perf lines for real-time updates
- Running elapsed time, ms/step from wall-clock timestamps, steps/sec
- Compute ANE + Total TFLOPS from FLOPs/step when not reported directly
- Support --ane (train_large_ane) and --no-ane-extras flags
- Dynamic pipeline timing breakdown + CKPT_PATH per mode
… MIL pipeline

[MLModel compileModelAtURL:] fails on macOS 26, breaking inmem_bench,
sram_bench, and sram_probe. This switches all three to generate MIL text
and weight blobs programmatically in memory (matching the working
inmem_peak.m approach), bypassing CoreML disk compilation entirely.

- inmem_bench.m: replace CoreML compile + file read with genMIL/buildWeightBlob
- sram_bench.m: switch from _ANEClient/_ANEModel to _ANEInMemoryModel API
- sram_probe.m: same _ANEClient → _ANEInMemoryModel conversion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Validate all fread() return values in model_load_weights (model.h)
- Check ane_eval() return values in ane_conv_eval (forward.h) and ane_eval_k (tiny_train.m)
- Log error details on ANE eval failure (ane_runtime.h)
- Thread-safe RMSNorm: replace global g_rms_tmp with local allocation (stories_cpu_ops.h)
- Bounds-check token indices in cross_entropy_loss, embed_lookup, embed_backward
- Atomic checkpoint writes via tmp+rename pattern (tiny_train.m)
- Non-destructive recompile: compile new kernels first, swap only on success (model.h)
- Validate fread() in load_checkpoint (tiny_train.m)
Updated README to reflect project scope, architecture, and limitations.
…ort-dataset-underflow-fix

Fix token sampling underflow for short token datasets
Fix docs: add training data download instructions
Optimize dashboard and prevent sudo hang when password needed
…hmarks

Fix benchmarks for macOS 26: replace compileModelAtURL with in-memory MIL
…ta-paths

Fix hardcoded TinyStories data path in train_large/train_large_ane
…ctness

fix: correctness and safety improvements for training
Follow-up to PR maderix#31 — assert() aborts on bad tokens, which is too
harsh for training. Skip bad tokens with a warning instead.
Community-submitted results for M1 Pro/Max, M3 Pro, M4 Pro/Max, M5.
Includes training performance, peak throughput, MIL compatibility
matrix, and structured JSON data.
All chips have 16 NE cores except Ultra (32 via UltraFusion).
M4 38 TOPS is INT8/mixed-precision, not comparable to M3 FP16 spec.
Benchmark report now includes full Stories110M model configuration
(arch, layers, dims, kernels). README updated: 12-layer results
replace stale single-layer numbers, limitations reflect current state.
New files:
- model_config.h: Parameterized model config with presets (Stories42M/110M, LLaMA-1B/7B),
  pipeline planning (compute_pipeline_plan), memory/FLOP estimation
- pipeline.h: Layer-group scheduler (PipelineScheduler state machine),
  compile budget tracking, mmap-based cross-exec() shared tensor state,
  exec() restart with automatic resume
- gradient_checkpoint.h: Activation checkpointing policies (ALL/BOUNDARY/SQRT/NONE),
  recompute tracking, memory savings estimation
- train_pipeline.m: Entry point with dry-run simulation mode -- prints full execution
  plan for any model config, simulates scheduler state machine
- Makefile: train_pipeline and train_pipeline_live targets

All additive -- existing train_large.m untouched.

Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
…tests

- model_config.h: Added headroom_pct field to CompileConfig, used in
  max_layers_per_compile() with validation (falls back to 10% for invalid
  values). All presets include default. --headroom CLI flag added.
- pipeline.h: Tightened mmap error handling — calloc checks, size
  validation in mmap_state_open (file size vs header, truncation
  detection), sentinel/version in error message, msync/munmap return
  checks in close.
- test_pipeline_unit.c: 23 unit tests for model_config, pipeline
  planning, gradient checkpoint, and FLOP estimation. Pure C, no ANE
  dependency. All passing.

Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
…ency, safety guards

Bug fix: n_checkpointed counting wrong in CKPT_BOUNDARY/SQRT/EVERY_N
  - Replaced per-policy arithmetic with single post-switch loop that counts
    actual is_saved bits. Eliminates edge-case miscounts when last layer
    falls on an interval boundary.

Inconsistency: headroom mismatch between planner and runtime budget
  - budget_init() now takes CompileConfig* and uses the same headroom_pct
    validation as max_layers_per_compile(). Both paths yield identical
    usable-budget calculations.

Inconsistency: total_model_bytes() omitted global gradients
  - Added rms_final_grad and embed_grad terms to match mmap_compute_size().
    Diagnostic output now agrees with actual allocation.

Design: divide-by-zero in model_dims_init() if n_heads=0
  - Guarded head_dim = dim / n_heads with n_heads > 0 check.

Design: no bounds checking in mmap typed accessors
  - All four mmap_layer_* accessors now validate layer index and return NULL
    on out-of-bounds. Extracted shared mmap_dims() helper to deduplicate
    ModelDims reconstruction.

Design: CKPT_EVERY_N interval hardcoded despite caller should set
  - Added custom_interval parameter to checkpoint_init(). Pass 0 for
    default (4), or any positive int for custom spacing.

Tests: 26/26 passing (3 new: custom interval, n_checkpointed accuracy,
zero-heads guard).

Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants